Graph-based extractive text summarization method for Hausa text

Automatic text summarization is one of the most promising solutions to the ever-growing challenges of textual data as it produces a shorter version of the original document with fewer bytes, but the same information as the original document. Despite the advancements in automatic text summarization research, research involving the development of automatic text summarization methods for documents written in Hausa, a Chadic language widely spoken in West Africa by approximately 150,000,000 people as either their first or second language, is still in early stages of development. This study proposes a novel graph-based extractive single-document summarization method for Hausa text by modifying the existing PageRank algorithm using the normalized common bigrams count between adjacent sentences as the initial vertex score. The proposed method is evaluated using a primarily collected Hausa summarization evaluation dataset comprising of 113 Hausa news articles on ROUGE evaluation toolkits. The proposed approach outperformed the standard methods using the same datasets. It outperformed the TextRank method by 2.1%, LexRank by 12.3%, centroid-based method by 19.5%, and BM25 method by 17.4%.


Introduction
Automatic text summarization (ATS) produces a shorter version of the original document, that has a smaller digital size in terms of bytes, and yet still retains the same information as the original document. This process reduces large documents to a concise representation that facilitates reading and comprehension by humans. ATS is one of the most promising solutions for the current challenge of information overload [1]. This technique is necessary because the amount of textual data increases continuously [2], which makes searching for the required information portion difficult and time-consuming. ATS has diverse applications in natural language processing (NLP) and information extraction [3], including search engines [4], news summarization [5][6][7], social post summarization [8,9], sentiment analysis [10,11], product reviews [12,13], and image captioning [14]. both inter-and intra-document diversity to determine the similarity between sentences, whilst Wang, Liu [53] applied a random walk algorithm to an affinity graph for MDS to impose diversity. Similarly, AlZahir, Fatima [54] used a multigraph model to represent text for extractive summarization and Ullah and Al Islam [55] proposed a semantic graph-based model for extractive text summarization by first extracting the predicate argument structure (PAS) of sentences that used to measure the semantic similarity between sentences. Graph-based text summarization methods have been proposed for different languages: Arabic text [56][57][58], Serbian [59], Bengali [60], Malayalam [61], Indonesian [34], Chinese [62] and Ambaric [63]. However, despite the advancements in ATS research, studies involving the development of ATS methods for documents written in Hausa, a Chadic language widely spoken in West Africa by approximately 150,000,000 people as either their first or second language, is still in the early stages of development. Hausa is widely spoken in Northern Nigeria, the Southern Niger Republic, and some parts of Cameroun and Ghana, among others. A graph-based ATS method has not been used in the Hausa language and the only method proposed for Hausa ATS, to date [64], is a machine-learning-based approach using the Naïve-Bayes classifier, which was trained and tested using only ten Hausa news articles. In this study, a graph-based ATS method is proposed for Hausa text extractive SDS by modifying the existing PageRank algorithm using normalized common bigram counts between sentences as initial vertex scores. The proposed method uses an undirected weighted graph model for textual representation. The text sentences are represented as graph vertices, and the edges between the nodes are determined by the similarity between the text sentences that are measured using cosine similarity.
The remainder of this paper is organized as follows. Section 2 discusses the materials and methods. Section 3 describes the proposed method in detail. Section 4 describes the dataset and details the experimental results. Section 4 presents the discussion, and Section 5 presents the conclusions and future work direction.

Materials and methods
The proposed graph-based ATS method for Hausa text comprises four main phases: text preprocessing, similarity calculation and graph construction, sentence ranking, and sentence selection, as illustrated in Fig 1. The input of the system is raw Hausa text, and the system preprocesses the text to clean and prepare it for the subsequent stages. Subsequently, an undirected weighted graph is constructed for the text. Text sentences are represented as graph vertices, and the edges between vertices are determined by the similarity between text sentences and the proposed ranking algorithm is applied to the graph to determine the final rank of the graph vertices.

Text preprocessing
The input text is a natural language that is unstructured and must therefore be transformed into a structured format. The preprocessing starts with case folding to convert all letters of the documents into lowercase letters and then further segmenting them into individual sentences; these are subsequently tokenized into a collection of words without punctuations. The Python NLTK library is used for both document segmentation and sentence tokenization. In Hausa text, similar to the English text, the sentences are identified with a period "." or colon ":" marking their end, and the words are identified by a space separating them. A Hausa stemmer [65] was used to normalize words to their stem form and stop words were removed for better scoring accuracy. A list of Hausa stop words [66] was used in this study, and punctuation, non-letters, and other special characters were removed from the input text documents. We consider the following Hausa sentence: "Abubakar ya na karatu a Jamiar UTM." The sentence is tokenized as follows: "Abubakar," "ya," "na," "karatu," "a," "Jamiar," and "UTM" using a space as a separator between tokens. The words "ya," "na," and "a" are stop-words according to the list [66], leaving only "Abubakar," "karatu," "Jamiar," and "UTM" and the word "Jamiar" is stemmed to Jamia, according to the stemmer [65].

Vector representation and graph construction
The processed text is represented as vectors of words using the term frequency (TF)-inverse document frequency (IDF) model. The text is modelled as a set D, where D = {s 1 , s 2 , . . ., s n }, s i is the corresponding i-th sentence in the document and n is the number of sentences contained in D. Each sentence of the document s i is represented as a vector of weights, si = (w i 1 , w i 2 , . . ., w i m ), i = 1, 2, . . ., n, where wi k is the weight of the term t k in the sentence s i . In the field of information systems, there are different approaches to weighting schemes; however, term-weighting schemes have been described as the most widely used representations for extractive summarization approaches [67]. The inner product of any two sentences (represented as vectors) provides the similarity between them, as shown in Eq 1.
where M is an integer representing the dimensions of space. The inner product is normalized by dividing it by the product of the vector lengths to obtain the cosine distance between them as follows: The dimensions of the vector space are equivalent to the number of terms in the document. The term frequency (TF) is computed as follows: ( TFs are multiplied by the inverse document frequency (IDF) to overcome the challenge of domain words. IDF is expressed as follows: where N represents the total number of sentences in the document and n(t) is the total number of sentences containing term t. A constant of value 1 was added to achieve a more even result as follows: The products of TF and IDF are denoted as TF-IDF, and the model is known as the bag-ofwords (BoW) model.
The text sentences represented as graph vertices and the adjacency matrix formed from the cosine similarities of the sentences are used to draw the edges of the graph. A similarity measure is used to determine the weights of the edges such that the weights are proportional to the strength of the causality measures between sentences. The presence or absence of an edge is determined by the value of the weights in the adjacency matrix. The edge between two sentences is considered if their adjacency value is at least 0.5, as used by Mihalcea and Tarau [42].

Proposed ranking algorithm
This paper presents a modified PageRank algorithm for ranking sentences in Hausa text for extractive ATS. The PageRank algorithm is a ranking algorithm originally proposed for webpage analysis and is conceptualized by the premise that the importance of a webpage is determined by the number and relative importance of pages linked to it. The pages are modelled as directed graphs, and the page ranks are represented by a column stochastic matrix. The ranks are then calculated iteratively by considering the ranks of the new incoming links.
Let A denote the column stochastic matrix and v i denote a vector representing the ranks at each iteration; the rank vector v is saturated at a certain value v*, known as the PageRank vector. Based on the algebraic theorem, v* is an eigenvector whose entries yield a value of 1 upon their summation. The rank of a node corresponds to the probability distribution of a random walker visiting the node. Hence, the unique vector v* in which the sequence converges is the stationary distribution value of the sequence.
The ranking problem is a graph random walk problem, which is a typical Markov chain transition problem. Similar to the Markov chain transition, an extreme condition occurs where a node known as a dangling node, which contains no outbound link, can be achieved. The original PageRank algorithm assigns a constant value of 1/n to a dangling vertex, where n represents the total number of nodes in the graph. Hence, the transition matrix of the PageRank algorithm can be defined as: Here, d is the probability of discontinuing browsing the page. In addition to the frequent occurrence of words, the modified algorithm prioritizes phrase repetition such that sentences with typical phrases have a higher probability of being selected in the summary. In this regard, a normalized bigram count common to adjacent sentences is used as the initial vertex score. A bigram is then used to estimate the probability of occurrence of a word based on the preceding word, which is calculated as follows: where W n is the word considered and W n-1 is the word preceding W n . The concept of bigrams has been applied in various NLP tasks, such as speech recognition [68] and grammar suggestions [69]. Unigram models, such as the BoW model, disregard the word order and context, and are expressed as follows: The count of typical bigrams between sentences are calculated as follows: Eq 10 is normalized by the total count of bigrams in the two sentences.
Using Laplace smoothing, a constant value of 1 is added to the numerator in the equation to avoid a zero count of bigrams.
The original PageRank algorithm can then be modified as follows: The ranking algorithm recursively computes the rank of a vertex in terms of its adjacency vertices. Given that the matrix is a column stochastic matrix, based on the Perron-Frobenius theorem, the dominant eigenvalue is 1. Subsequently, based on the power method convergence theorem, matrix converges to N, where N is the total number of graph vertices. Convergence is achieved in fewer iterations when the size of the sentences in a document is considered. Based on Langville and Meyer's theorem, the iteration process has a time complexity of O(n m ). The overall process of the proposed algorithm is summarized in Table 1.

Summary generation
The document sentences were sorted in descending order of their scores, and sentences with the highest ranks were selected and rearranged according to their original indexes in the document. The number of sentences in the final summary (FN) was determined using the assigned summary compression ratio, which was calculated using Eq 17: where CR is the compression ratio and |D S | is the total number of sentences in the original input document.

Results and discussion
This section presents the corpus used for the experiment, the detailed experiments conducted, and the results obtained from the experiments. The section also presents performance evaluations to compare the performance of the proposed method with some standard methods and a detailed discussion and analysis of the experimental results.

Evaluation metrics
The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [70], a recall n-gram content-based summary measure, was used to evaluate the proposed method. ROUGE supports the comparison of system summaries with more than one reference summary; it was the first proposed automatic summary evaluation tool and remains the most commonly used one [71]. ROUGE uses two metrics for the evaluation of system-created summaries: precision and recall. Precision (P) is the ratio of the number of true positives to the sum of true positives and false positives, and is defined as follows: Recall (R) is the ratio of sentences present in both system-generated and reference summaries to the number of sentences in the reference summary, and is defined as follows: The harmonic average of recall and precision is called the F-score, and is calculated as in Eq 3.
Three variants of the ROUGE simulator-ROUGE-1, ROUGE-2, and ROUGE-L-were used in this study. The ROUGE-1 metric compares the similarity of unigrams between the system-generated and reference summaries. The ROUGE-2 metric compares the similarity of the bigrams between the system-generated and reference summaries. ROUGE-L stands for ROUGE longest common subsequence, which uses the LCS metric to compare the systemgenerated and reference summaries.

Experiment
To evaluate the performance of the proposed model, different experiments were conducted with 100, 200, 300, 400, and 500 iterations, as listed in Table 3. The system-generated summaries were compared with gold standard summaries using the ROUGE simulator; for each metric, the average values of the recall, precision, and F-score were recorded separately.

Comparison with standard methods
The performance of the proposed method was compared with that of some selected standard extractive summarization methods on the same Hausa dataset. The following methods were selected for the performance comparison: TextRank, LexRank, centroid-based, and BM25-TextRank. The TextRank method [42] was the first graph-based method for extractive summarization based on the concept of the PageRank algorithm, which represented document sentences using the vertices of an undirected weighted graph; the edges of the graph were determined using a measure of word overlap between sentences. LexRank [43] is a graphbased method for extractive summarization that uses the concept of eigenvector centrality to determine sentence ranks. The centroid-based method [72] is an unsupervised text summarization method based on a word-embedding technique that utilizes continuous vector representation to capture the semantic meaning of words. The BM25-TextRank method [73] is a combination of TextRank and BM25 ranking function that used for ranking objects in information retrieval tasks using a probabilistic model. Table 4 and Fig 2 illustrate the results of the experiments, as detailed in the Discussion section.  Table 4 presents the results of the experiments using the proposed method and other standard methods. The average precision, recall, and F-scores under different numbers of iterations were compared, and the proposed method outperformed all the remaining four methods using the same dataset for all metrics of Rouge-1, Rouge-2, and Rouge-L.

Discussion
The experiments results showed that at 100 iterations, the proposed method outperformed the TextRank method by 8.5%, LexRank with an average F-score of 12.4%, Centroid-based method by 14.0%, and BM25-TextRank Method by 13.0%. At 200 iterations, the proposed method outperformed the TextRank method by 11.1%, LexRank method by 14.0%, Centroidbased method by 21.4%, and BM25-TextRank Method by 14.0%. At 300 iterations, the proposed method outperformed the TextRank method by 9.7%, LexRank method by 10.7%, centroid-based method by 18.9%, and BM25 method by 17.9%. At 400 iterations, the proposed method outperformed the TextRank method by 0.6%, LexRank by 10.6%, centroid-based method by 16.7%, and BM25 method by 17.1%. At 500 iterations, the proposed method outperformed the TextRank method by 2.1%, LexRank by 12.3%, centroid-based method by 19.5%, and BM25-TextRank method by 17.4%. The performance of the methods improved with an increasing number of iterations, but saturated after 500 iterations. The results obtained from the experiments and various analyses shows that the proposed method, which is an enhancement of the PageRank algorithm that uses the normalized common bigram count between adjacent sentences as the initial vertex score, outperforms the baseline methods using the same Hausa text summarization dataset.

Conclusion
This paper presents a novel graph-based extractive single-document summarization method for Hausa texts. The method was designed by modifying the PageRank algorithm using normalized common bigram counts between adjacent sentences as the initial vertex scores. Experimental results showed that the proposed method outperformed the baseline methods using the same datasets for all metrics of Rouge-1, Rouge-2, and Rouge-L. The main contribution of this study is the introduction of a new ranking method for Hausa text-extractive summarization. The proposed unsupervised method can also be applied to any language with lexical polysemy.
In the future, the following will be explored: extending the ranking method to multi-document extractive summarization by combining it with other techniques to reduce redundancies associated with multi-document summarization. Other similarity measures should be used along with a ranking method to determine the performance of the method using different similarity measures.